home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group98c.txt
/
000007_icon-group-sender _Thu Sep 10 12:24:31 1998.msg
< prev
next >
Wrap
Internet Message Format
|
2000-09-20
|
6KB
Return-Path: <icon-group-sender>
Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239])
by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id MAA26222
for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Thu, 10 Sep 1998 12:24:31 -0700 (MST)
Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM)
id AA31099; Thu, 10 Sep 1998 12:24:04 -0700
From: gep2@computek.net
Date: Thu, 10 Sep 1998 13:46:57 -0500 (CDT)
Message-Id: <199809101846.NAA18441@mail.cmpu.net>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Subject: Unicode support or support for non-Ascii based character
manipulation?
To: icon-group@optima.CS.Arizona.EDU
X-Mailer: SPRY Mail Version: 04.00.06.17
Content-Transfer-Encoding: 7bit
Content-Transfer-Encoding: 7bit
Errors-To: icon-group-errors@optima.CS.Arizona.EDU
Content-Transfer-Encoding: 7bit
Status: RO
> Icon has been a very interesting language for string manipulation,
Certainly! If not the MOST interesting language for such purposes.
> however, the limit of supporting only ASCII
Actually, that's not really true. Icon is much more free of "supporting ONLY
ASCII" than C, for example. (I don't know how true this is about things like
conversions... does Icon automatically support EBCDIC character assignments, for
example, if generated on an EBCDIC system?)
Certainly though there are issues that come up with supporting international
characters, and in part that's due to the fact that there doesn't seem to be any
real international agreement on how (at least some) other natural languages map
their alphabetic characters into (at least) an 8-bit byte. Hebrew is one
example, where there seem to be at least three different competing "standards"
for where the characters are mapped.
> makes it less useful for non-English language work.
Well, I agree that it's perhaps less useful than it MIGHT be, but I still
suspect it's (far!) more useful than OTHER programming languages are for these
kinds of things.
> With the computer industry heading towards Unicode support,
Okay, I don't dispute that this move is happening but personally I still don't
very much like it. The fact is that (at least here in the Western Hemisphere,
where probably most of the world's computers are used) an eight-bit byte is
already quite sufficient for most purposes, and doubling it comes at a cost in
complexity and storage (RAM, disk, tape, whatever) which is simply very, very
hard to justify on any genuine economic basis. If other countries have more
difficult (or huge) character sets, that is (while a fact of life) simply an
inherent disadvantage of their culture (and note that I'm not intending that as
a slam or value judgement, it just IS the way it is), and I don't see a terribly
convincing argument why the other countries (without that disadvantage) ought to
pay the price too, just in order to artificially level the playing field.
> ...it should be possible to begin including support for
non-English and non alphabetic languages.
I think that a lot of the basic manipulations and features in Icon (tables,
sets, etc) are probably insensitive to the character mapping used. And Icon
does seem to be pretty much (totally?) eight-bit clean (unlike C), which at
least gives one the ability to construct stuff on top of it to support other
languages.
One issue, of course, is the one I mentioned earlier... conversions, although
numeric formatting is one other specific example of a potential problem area.
Certainly not all cultures prefer Arabic numerals.
Another issue, perhaps unique to Icon, is the implementation of "character set"
datatypes, which I'd suspect would end up being quite different for a language
containing 65,536 distinct characters... since the character set data
representation, presumably unless a different implementation technique were
used, would be not twice but 256 times larger than for an eight-bit character
set.
I can certainly understand and appreciate the problems that the huge character
sets used in some eastern countries have played for them, and frankly have been
surprised by the extent to which solutions for things like keyboards have been
mastered. And text processing with such large character sets certainly must
represent a whole series of unique challenges, so I can understand the interest
in those countries in something like Icon for attacking them.
> Has anyone thought about this yet? What does string and pattern matching
mean in, for example, Japanese?
I have given the matter some thought, although just as an 'outside observer'. I
would presume that a "full/nice" implementation for such languages would result
in simply processing Unicode-like 16-bit characters, with everything that
involves. At *some* point, barring having complete 16-bit-byte uniformity
across everything from CPUs and operating systems to peripheral devices, there
might have to be some conversions and "glue" interface work done, and
classically it's at those border/edge regions that the seams tend to be less
than pretty.
Certainly one of the more interesting Icon-related issues I've seen come up here
in a while. I seem to recall it was mentioned briefly some time ago (perhaps
that was on the SNOBOL4 list instead?) but didn't go very far at the time.
Gordon Peterson
http://www.computek.net/public/gep2/
Support the Anti-SPAM Amendment! Join at http://www.cauce.org/